The mutate() function takes in the following arguments: the first argument is the dataframe of interest, and the second argument is a new or existing data variable that is defined in terms of other data variables.
I think of them like formulas in Excel. . . .
[1] 4.634012 4.638653 4.032101 5.503031 3.713696 3.972693 3.235727 4.135042
[9] 9.017365 3.940167
One thing we often need to do is scale or normalize data. Let’s try scaling by the maximum value in a column. Try calculating a new variable KRAS_scaled by calculating:
Try calcluating a new variable in mutation called TP53_CDKN2A which is equal to:
TP53_Mut | CDKN2A_MutThen filter new_mutation by TP53_CDKN2A == TRUE. How many rows did you return?
The group_by() function returns the identical input dataframe but remembers which variable(s) have been marked as grouped.
The summarise() (you can also use summarize()) returns one row for each combination of grouping variables, and one column for each of the summary statistics that you have specified.
Functions you can use for summarise() must take in a vector and return a simple data type, such as any of our summary statistics functions: mean(), median(), min(), max(), etc.
The exception is n(), which returns the number of entries for each grouping variable’s value.
First group by OncotreeLineage. Then try calculating the maximum age as max_age within metadata when it is grouped by OncotreeLineage:
Our summarization has a lot of NAs. We can use the na.rm argument in max() to drop NA values in our calculation.
First group by OncotreeLineage. Then try calculating the median age as median_age within metadata when it is grouped by OncotreeLineage. (Hint: use the median() function. Bonus points if you use the na.rm argument)
What is the difference between these two pipelines?
arrange()If we need to sort by a column, we can use arrange():
If we want to sort by descending order, then we can wrap our column in desc() (short for descending):
We can sort by multiple columns by passing in multiple variables:
Try sorting by desc(Age) and OncotreeLineage, and then by OncotreeLineage and desc(Age). Does order matter?
metadata
| ModelID | OncotreeLineage | Age |
|---|---|---|
| “ACH-001113” | “Lung” | 69 |
| “ACH-001289” | “CNS/Brain” | NA |
| “ACH-001339” | “Skin” | 14 |
expression
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp |
|---|---|---|
| “ACH-001113” | 5.138733 | 1.636806 |
| “ACH-001289” | 3.184280 | 1.158226 |
| “ACH-001339” | 3.165108 | 1.152187 |
I want to compare the relationship between OncotreeLineage and PIK3CA_Exp:
| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age |
|---|---|---|---|---|
| “ACH-001113” | 5.138733 | 1.636806 | “Lung” | 69 |
| “ACH-001289” | 3.184280 | 1.158226 | “CNS/Brain” | NA |
| “ACH-001339” | 3.165108 | 1.152187 | “Skin” | 14 |
One strategy we can use is to only merge on common ids between the two tables. If one table has ids that aren’t in the second table, then we’ll remove those rows.
We see that in both dataframes, the rows (observations) represent cell lines with a common column ModelID, so let’s merge these two dataframes together, using inner_join():
Let’s take a look at the dimensions:
inner_join() keeps all observations common to both dataframes based on the common column defined via the by argument.
Given xxx_join(x, y, by = "common_col"),
full_join() keeps all observations.
left_join() keeps all observations in x.
right_join() keeps all observations in y.
inner_join() keeps observations common to both x and y.
Try out inner_join() on the two tables and compare the number of rows in the merged table.
Now try full_join() on the two tables and compare the number of rows in the merged table.
Learning Community Week!
Suggest topics here: